17 research outputs found

    Design and implementation of a telemetry platform for high-performance computing environments

    Get PDF
    A new generation of high-performance and distributed computing applications and services rely on adaptive and dynamic architectures and execution strategies to run efficiently, resiliently, and at scale in today’s HPC environments. These architectures require insights into their execution behaviour and the state of their execution environment at various levels of detail, in order to make context-aware decisions. HPC telemetry provides this information. It describes the continuous stream of time series and event data that is generated on HPC systems by the hardware, operating systems, services, runtime systems, and applications. Current HPC ecosystems do not provide the conceptual models, infrastructure, and interfaces to collect, store, analyse, and integrate telemetry in a structured and efficient way. Consequently, applications and services largely depend on one-off solutions and custom-built technologies to achieve these goals; introducing significant development overheads that inhibit portability and mobility. To facilitate a broader mix of applications, more efficient application development, and swift adoption of adaptive architectures in production, a comprehensive framework for telemetry management and analysis must be provided as part of future HPC ecosystem designs. This thesis provides the blueprint for such a framework: it proposes a new approach to telemetry management in HPC: the Telemetry Platform concept. Departing from the observation that telemetry data and the corresponding analysis, and integration pat- terns on modern multi-tenant HPC systems have a lot of similarities to the patterns observed in large-scale data analytics or “Big Data” platforms, the telemetry platform concept takes the data platform paradigm and architectural approach and applies them to HPC telemetry. The result is the blueprint for a system that provides services for storing, searching, analysing, and integrating telemetry data in HPC applications and other HPC system services. It allows users to create and share telemetry data-driven insights using everything from simple time-series analysis to complex statistical and machine learning models while at the same time hiding many of the inherent complexities of data management such as data transport, clean-up, storage, cataloguing, access management, and providing appropriate and scalable analytics and integration capabilities. The main contributions of this research are (1) the application of the data platform concept to HPC telemetry data management and usage; (2) a graph-based, time-variant telemetry data model that captures structures and properties of platform and applications and in which telemetry data can be organized; (3) an architecture blueprint and prototype of a concrete implementation and integration architecture of the telemetry platform; and (4) a proposal for decoupled HPC application architectures, separating telemetry data management, and feedback-control-loop logic from the core application code. First experimental results with the prototype implementation suggest that the telemetry platform paradigm can reduce overhead and redundancy in the development of telemetry-based application architectures, and lower the barrier for HPC systems research and the provisioning of new, innovative HPC system services

    Seastar: A Comprehensive Framework for Telemetry Data in HPC Environments

    Get PDF
    A large number of 2nd generation high-performance computing applications and services rely on adaptive and dynamic architectures and execution strategies to run efficiently,resiliently, and at scale on today’s HPC infrastructures. They require information about applications and their environment to steer and optimize execution. We define this information as telemetry data. Current HPC platforms do not provide the infrastructure,interfaces and conceptual models to collect, store, analyze,and access such data. Today, applications depend on application and platform specific techniques for collecting telemetry data; introducing significant development overheads that inhibit portability and mobility. The development and adoption of adaptive, context-aware strategies is thereby impaired. To facilitate 2nd generation applications,more efficient application development, and swift adoption of adaptive applications in production, a comprehensive framework for telemetry data management must be provided by future HPC systems and services. We introduce Seastar, a conceptual model and a software framework to collect, store, analyze, and exploit streams of telemetry data generated by HPC systems and their applications. We show how Seastar can be integrated with HPC platform architectures and how it enables common application execution strategies.Postprin

    Rethinking High Performance Computing Platforms: Challenges, Opportunities and Recommendations

    Get PDF
    A new class of Second generation high-performance computing applications with heterogeneous, dynamic and data-intensive properties have an extended set of requirements, which cover application deployment, resource allocation, -control, and I/O scheduling. These requirements are not met by the current production HPC platform models and policies. This results in a loss of opportunity, productivity and innovation for new computational methods and tools. It also decreases effective system utilization for platform providers due to unsupervised workarounds and rogue resource management strategies implemented in application space. In this paper we critically discuss the dominant HPC platform model and describe the challenges it creates for second generation applications because of its asymmetric resource view, interfaces and software deployment policies. We present an extended, more symmetric and application-centric platform model that adds decentralized deployment, introspection, bidirectional control and information flow and more comprehensive resource scheduling. We describe cHPC: an early prototype of a non-disruptive implementation based on Linux Containers (LXC). It can operate alongside existing batch queuing systems and exposes a symmetric platform API without interfering with existing applications and usage modes. We see our approach as a viable, incremental next step in HPC platform evolution that benefits applications and platform providers alike. To demonstrate this further, we layout out a roadmap for future research and experimental evaluation

    SAGA: A standardized access layer to heterogeneous Distributed Computing Infrastructure

    Get PDF
    Distributed Computing Infrastructure is characterized by interfaces that are heterogeneous—syntactically and semantically. SAGA represents the most comprehensive community effort to date to address the heterogeneity by defining a simple, uniform access layer. In this paper, we describe the basic concepts underpinning its design and development. We also discuss RADICAL-SAGA which is the most widely used implementation of SAGA

    Federation and Interoperability Use Cases, version 1.1

    No full text
    These use cases describe how scientific projects make use of resources from more than one public research computing community, and how a given community can support such projects. A public research computing community is an organized set of resources and processes that work together coherently to serve the needs of a community of researchers. Many such communities exist worldwide, such as XSEDE (www.xsede.org) and PRACE (www.prace-ri.eu). Each community has computing resources (e.g., an HPC cluster or an archival storage service) and a set of services that add value to the community’s resources (e.g., a community login service or a user support service). Although each use case describes an activity that uses resources from multiple communities, the use cases themselves are written from the point of view of a single community. The goal is to express what a single community must offer to support the activity.National Science Foundation, OCI-1053575Ope

    Federation and Interoperability Use Cases, version 1.1

    No full text
    These use cases describe how scientific projects make use of resources from more than one public research computing community, and how a given community can support such projects. A public research computing community is an organized set of resources and processes that work together coherently to serve the needs of a community of researchers. Many such communities exist worldwide, such as XSEDE (www.xsede.org) and PRACE (www.prace-ri.eu). Each community has computing resources (e.g., an HPC cluster or an archival storage service) and a set of services that add value to the community’s resources (e.g., a community login service or a user support service). Although each use case describes an activity that uses resources from multiple communities, the use cases themselves are written from the point of view of a single community. The goal is to express what a single community must offer to support the activity.National Science Foundation, OCI-1053575Ope

    Rethinking High Performance Computing platforms:challenges, opportunities and recommendations

    No full text
    A growing number of "second generation" high-performance computing applications with heterogeneous, dynamic and data-intensive properties have an extended set of requirements, which cover application deployment, resource allocation, -control, and I/O scheduling. These requirements are not met by the current production HPC platform models and policies. This results in a loss of opportunity, productivity and innovation for new computational methods and tools. It also decreases effective system utilization for platform providers due to unsupervised workarounds and "rogue'" resource management strategies implemented in application space. In this paper we critically discuss the dominant HPC platform model and describe the challenges it creates for second generation applications because of its asymmetric resource view, interfaces and software deployment policies. We present an extended, more symmetric and application-centric platform model that adds decentralized deployment, introspection, bidirectional control and information flow and more comprehensive resource scheduling. We describe cHPC: an early prototype of a non-disruptive implementation based on Linux Containers (LXC). It can operate alongside existing batch queuing systems and exposes a symmetric platform API without interfering with existing applications and usage modes. We see our approach as a viable, incremental next step in HPC platform evolution that benefits applications and platform providers alike. To demonstrate this further, we layout out a roadmap for future research and experimental evaluation
    corecore